Skip to content

Natural Language and Image-Based Search Support for Recordings#603

Open
WANGXIAOMIN-HIK wants to merge 11 commits intodevelopmentfrom
video/NLSearch&ImageSearch
Open

Natural Language and Image-Based Search Support for Recordings#603
WANGXIAOMIN-HIK wants to merge 11 commits intodevelopmentfrom
video/NLSearch&ImageSearch

Conversation

@WANGXIAOMIN-HIK
Copy link
Copy Markdown

To enhance ONVIF's search capabilities, the following operations have been added to support natural language and image-based search for video recordings:

FindImagebyNL
Purpose: Starts a search session using natural language descriptions to locate relevant video recordings. Example Query: "Person wearing a red hat."
Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search. Text: Natural language description for the search. Likelihood: (Optional) Similarity threshold for the search (0~1). MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response:
SearchToken: A unique reference to the search session. GetNLSearchResults
Purpose: Retrieves results from a natural language search session initiated by FindImagebyNL. Parameters:
SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. FindImagebyImage
Purpose: Starts a search session using a target image to locate relevant video recordings. Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search. TargetImageURI: URI of the target image to be searched. MaxMatches: (Optional) Maximum number of matches to return. KeepAliveTime: Time the search session will be kept alive. Response:
SearchToken: A unique reference to the search session. GetImageSearchResults
Purpose: Retrieves results from an image-based search session initiated by FindImagebyImage. Parameters:
SearchToken: Token identifying the search session. MinResults: (Optional) Minimum number of results to return. MaxResults: (Optional) Maximum number of results to return. WaitTime: (Optional) Maximum time to wait for results. Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken. Schema Updates
onvif.xsd:

Added complex types for FindImageResult and FindImageResultList to support result structures for both natural language and image-based searches. Included fields like TargetImageURI, Time, Likelihood, and RecordingToken. search.wsdl:

Defined operations FindImagebyNL, GetNLSearchResults, FindImagebyImage, and GetImageSearchResults. Added request and response elements for each operation. Documentation Updates
RecordingSearch.xml:
Added detailed descriptions for FindImagebyNL and GetNLSearchResults operations, explaining their purpose, parameters, and responses.

To enhance ONVIF's search capabilities, the following operations have been added to support natural language and image-based search for video recordings:

FindImagebyNL
Purpose: Starts a search session using natural language descriptions to locate relevant video recordings.
Example Query: "Person wearing a red hat."
Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search.
Text: Natural language description for the search.
Likelihood: (Optional) Similarity threshold for the search (0~1).
MaxMatches: (Optional) Maximum number of matches to return.
KeepAliveTime: Time the search session will be kept alive.
Response:
SearchToken: A unique reference to the search session.
GetNLSearchResults
Purpose: Retrieves results from a natural language search session initiated by FindImagebyNL.
Parameters:
SearchToken: Token identifying the search session.
MinResults: (Optional) Minimum number of results to return.
MaxResults: (Optional) Maximum number of results to return.
WaitTime: (Optional) Maximum time to wait for results.
Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken.
FindImagebyImage
Purpose: Starts a search session using a target image to locate relevant video recordings.
Parameters:
StartPoint: Start time for the search.
EndPoint: End time for the search.
RecordingToken: (Optional) Token for the recording to search.
TargetImageURI: URI of the target image to be searched.
MaxMatches: (Optional) Maximum number of matches to return.
KeepAliveTime: Time the search session will be kept alive.
Response:
SearchToken: A unique reference to the search session.
GetImageSearchResults
Purpose: Retrieves results from an image-based search session initiated by FindImagebyImage.
Parameters:
SearchToken: Token identifying the search session.
MinResults: (Optional) Minimum number of results to return.
MaxResults: (Optional) Maximum number of results to return.
WaitTime: (Optional) Maximum time to wait for results.
Response:
ResultList: List of matching results, including metadata such as TargetImageURI, Time, Likelihood, and RecordingToken.
Schema Updates
onvif.xsd:

Added complex types for FindImageResult and FindImageResultList to support result structures for both natural language and image-based searches.
Included fields like TargetImageURI, Time, Likelihood, and RecordingToken.
search.wsdl:

Defined operations FindImagebyNL, GetNLSearchResults, FindImagebyImage, and GetImageSearchResults.
Added request and response elements for each operation.
Documentation Updates
RecordingSearch.xml:
Added detailed descriptions for FindImagebyNL and GetNLSearchResults operations, explaining their purpose, parameters, and responses.
…search

Updated document and WSDL definitions to allow multiple recording tokens to be passed in in a search operation to query multiple recordings at the same time.
<xs:element name="EndPoint" type="xs:dateTime">
<xs:annotation><xs:documentation>End time for the search.</xs:documentation></xs:annotation>
</xs:element>
<xs:element name="RecordingToken" type="tt:RecordingReference" minOccurs="0" maxOccurs="unbounded">
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add maxOccurs="unbounded" which will allow more than one recording container to search.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for your opinion. In the last meeting, one of the judges raised the desire to support the search of multiple recording container。

<xs:element name="EndPoint" type="xs:dateTime">
<xs:annotation><xs:documentation>End time for the search.</xs:documentation></xs:annotation>
</xs:element>
<xs:element name="RecordingToken" type="tt:RecordingReference" minOccurs="0" maxOccurs="unbounded">
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add maxOccurs="unbounded" which will allow more than one recording container to search.

<xs:annotation><xs:documentation>This element contains a list of recording tokens to search.</xs:documentation></xs:annotation>
</xs:element>
<xs:element name="TargetImageURI" type="xs:anyURI">
<xs:annotation><xs:documentation>The target image to be searched in LocalStorage URI format.</xs:documentation></xs:annotation>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how client gets this local storage URI to search?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for your opinion. the TargetImageURI is the result returned from SearchImageByNL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot we use SearchImagebyImage Request with out getting result from SearchImageByNL? I feel SearchImagebyImage and SearchImageByNL are independent search sessions, i.e one is image based search and other is text based search.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WANGXIAOMIN-HIK Yes I agree with @venki5685. I feel there shhould not be a dependence on either API!.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for your opinion. @venki5685 @kieran242
We think about it carefully, SearchImagebyImage and SearchImageByNL are independent search sessions.
TargetImageURI can be a local URI or a remote URI.

</xs:element>

<!-- Define FindImagebyImage -->
<xs:element name="FindImagebyImageRequest">
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FindImagebyImage name can be revisited.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you for your opinion. We change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pay attention to #603 5b48ba2, the other two records are bad commits (dff6a6f and 57a89d5).

WANGXIAOMIN-HIK added a commit to WANGXIAOMIN-HIK/specs that referenced this pull request Jul 14, 2025
1.change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL.
2.memo the TargetImageURI is the result returned from SearchImageByNL.
WANGXIAOMIN-HIK added a commit to WANGXIAOMIN-HIK/specs that referenced this pull request Jul 14, 2025
1.change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL.
2.memo the TargetImageURI is the result returned from SearchImageByNL.
1.change the FindImageByImage to SearchImageByImage, and FindImageByNL to SearchImageByNL.
2.memo the TargetImageURI is the result returned from SearchImageByNL.
@kieran242
Copy link
Copy Markdown
Contributor

@WANGXIAOMIN-HIK

Is this functionality aimed at a Camera device or Network Video Recorder or Both?
How is the device supporting NL?

@WANGXIAOMIN-HIK
Copy link
Copy Markdown
Author

Yes, thank you for your opinion. @kieran242

Yes, Both. this functionality aimed at a Camera device and Network Video Recorder.
To implement this function on the camera device, the camera device needs to have storage space for images, such as installing a SD card.

The device implements the following algorithm,algorithm leverages massive annotated image-text pairs during training, where visual features (e.g., "dog", "snow", "yellow” ,“fur") and textual elements (e.g., tokenized "snow/field/dog") are extracted through cross-modal neural networks, forming the foundation of its text-to-image retrieval model. In deployment, the system processes video streams to detect targets, then employs on-device models to generate and store binary-encoded feature vectors for rapid matching.

@kieran242
Copy link
Copy Markdown
Contributor

@WANGXIAOMIN-HIK very kind thanks for your response. It was very informative.

</section>

<section>
<title>SerachImagebyImage</title>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerachImagebyImage -> SearchImagebyImage

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your clear feedback. I have now fixed the issue.I appreciate your help.

Copy link
Copy Markdown
Contributor

@kieran242 kieran242 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WANGXIAOMIN-HIK a few minor updates as suggestions wsdl update is good but spelling mistake in doc.


<section>
<title>SerachImagebyImage</title>
<para>SerachImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<para>SerachImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>
<para>SearchImagebyImage starts a search session, looking for video records based on a provided image. Results from the search are acquired using the GetImageSearchResults request, specifying the search token returned from this request.</para>

</section>
<section>
<title>GetImageSearchResults</title>
<para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SerachImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SerachImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>
<para>GetImageSearchResults acquires the results from an image-based search session previously initiated by a SearchImagebyImage operation. The response shall not include results already returned in previous requests for the same session.</para>

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your advice. I have fixed the spelling error. I appreciate your help.

<para role="text">The point of time where the search will stop.</para>
<para role="param">RecordingToken - optional [tt:RecordingReference]</para>
<para role="text">Token for the recording to search.</para>
<para role="param">TargetImageURI [xs:anyURI]</para>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a search parameter to accept external image from client in addition to NPL Target image URI. Either client can use NPLTargetImageURI or External Image from client for image search feature.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We Update the search functionality, add the TargetImageData parameter.

…nd provide detailed explanation of the use of TargetImageURI
<para role="param">TargetImageURI [xs:anyURI]</para>
<para role="text">The TargetImageURI is the result returned from SearchImageByNL.</para>
<para role="param">TargetImageURI - optional [xs:anyURI]</para>
<para role="text">The URI of the detected target object image. This can be either: - a local image stored in the NPL Target Image repository (LocalStorage format), or - an external image provided by the client for image search or feature matching.</para>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WANGXIAOMIN-HIK please add an entry for "NPL" to this documents "Definitions" in section 3.1 to explain that it is "Natural Language Processing". It will add clarity in the document.

… the terminology, as the original meaning refers to the images stored internally on the device.
dstafx
dstafx previously requested changes Nov 17, 2025
@ocampana-videotec ocampana-videotec added this to the 26.06 milestone Dec 4, 2025
the description :It represents the cosine similarity between two vectors, which is used to measure the similarity of the directions of the vectors. The closer the value is to 1, the higher the similarity; the closer the value is to 0, the lower the similarity.
<para role="text">Token for the recording to search.</para>
<para role="param">Text [xs:string]</para>
<para role="text">Natural language description for the search.</para>
<para role="param">CosineSimilarity - optional [xs:float]</para>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CosineSimularity is just one out of a set of common simularity measures.
I can imagine that ONVIF just defines an abstract simularity bewteen zero and one or a complex item supporting multiple similarity measures.

For the sake of simplicity I prefer the first approach as the second one would require a set of capabilities which ones a device supports.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,
Cosine similarity is just one of the most commonly used similarity measures in vector space. In actual image retrieval/similarity assessment, multiple distance measures, matching based on local features, perceptual similarity, as well as learned metrics or hash/quantization indexing are also used.
Could we consider changing the field back to 'simularity', but in the comments, we can note that we are currently using the cosine vector method? @dstafx

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for me since this is a big and important topic that we likely need to come back to on a more general level. But to explain a bit more:

My general concern (and the industry challenge) is that if we are too generic it will not be useful across vendors as the implementation scores would not be compareable. The message to the client if it is a generic number is likely that every enpoint where this interface is offered may have different implementations so the similarity is not comparable between different endpoints. So when searching some devices or vendors may consistently report higher similarities thereby potentially hiding relevant results. By stating that the similarity is only for sorting search results from a single endpoint we can avoid this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your detailed explanation. After careful consideration, I still think it should be defined as similarity, as we cannot restrict the vendors' implementation algorithms.

However, regarding the issue you mentioned about cross-device and cross-vendor search, we can add a note stating that this similarity is only guaranteed to be effective for ranking within the same search result returned by the same endpoint, and it does not guarantee that similarity can be compared across different vendors, devices, or endpoints.

As for the cross-device search issue, we can discuss it in our next meeting. This would require imposing constraints on the hardware vendors' implementation mechanisms, such as ensuring that devices support the same algorithm scheduling to guarantee consistency in device detection mechanisms.

@dstafx dstafx dismissed their stale review January 15, 2026 10:09

Don't want to have this blocking

@kieran242
Copy link
Copy Markdown
Contributor

kieran242 commented Jan 23, 2026

@WANGXIAOMIN-HIK @dstafx @HansBusch is this issue resolved regarding the "similarity measures" ? I see it is required for IPR| review.

Copy link
Copy Markdown
Contributor

@kieran242 kieran242 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WANGXIAOMIN-HIK approved with discussion on VE WG Call.

@sujithhanwha
Copy link
Copy Markdown
Contributor

It appears that both APIs, GetImageSearchResults and GetNLSearchResults, currently return identical results.

We could either consolidate them into a single unified method for search results,
or,
if we prefer to keep them separate, define a distinct structure for NL search (e.g., FindObjectImageResultListFindNLResultList) and adjust the response format to differentiate them. For instance, in GetNLSearchResults, the Image field could be optional when the search is based purely on metadata/text inference.

@WANGXIAOMIN-HIK
Copy link
Copy Markdown
Author

GetNLSearchResults and GetImageSearchResults are distinguished by their usage scenarios.
GetNLSearchResults is used for searching images based on natural language, retrieving images that closely match the natural language description.
GetImageSearchResults, on the other hand, searches for images using an image as input. It is based on image modeling and can accurately find the corresponding image information.
Users first use GetNLSearchResults for preliminary retrieval, then select the images they want, and use GetImageSearchResults for precise searching.
These two interfaces essentially both search for images, so they adopt the same structure.
@sujithhanwha

@sujithhanwha
Copy link
Copy Markdown
Contributor

GetNLSearchResults and GetImageSearchResults are distinguished by their usage scenarios. GetNLSearchResults is used for searching images based on natural language, retrieving images that closely match the natural language description. GetImageSearchResults, on the other hand, searches for images using an image as input. It is based on image modeling and can accurately find the corresponding image information. Users first use GetNLSearchResults for preliminary retrieval, then select the images they want, and use GetImageSearchResults for precise searching. These two interfaces essentially both search for images, so they adopt the same structure. @sujithhanwha

@WANGXIAOMIN-HIK ,
My question is not about the usage scenarios—I fully agree with those distinctions. It’s purely about the interface design. If both GetImageSearchResults and GetNLSearchResults return the same format, why do we need two separate methods? Could we not unify them into a single method that accepts search token from both image and NL search.

If you believe it makes sense to keep them separate, I’d recommend using different result formats. This would allow each interface to evolve independently. (For example, if we later add parameters specific to image-based search, they wouldn’t automatically apply to natural language search results. )

… of the interface.

We have added FindNLSearchResultList and FindNLSearchResult to distinguish the return values of the GetNLSearchResults and GetImageSearchResults interfaces.
GetNLSearchResults -> FindNLSearchResultList, FindNLSearchResult
GetImageSearchResults -> FindObjectImageResultList, FindObjectImageResult
@WANGXIAOMIN-HIK
Copy link
Copy Markdown
Author

Thank you for your suggestion. @sujithhanwha
We have made the following changes considering the future scalability of the interface.
We have added FindNLSearchResultList and FindNLSearchResult to distinguish the return values of the GetNLSearchResults and GetImageSearchResults interfaces.
GetNLSearchResults -> FindNLSearchResultList, FindNLSearchResult
GetImageSearchResults -> FindObjectImageResultList, FindObjectImageResult

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants